INTERSPEECH.2015 - Analysis and Assessment

Total: 52

#1 Analysis of a low-dimensional bottleneck neural network representation of speech for modelling speech dynamics [PDF] [Copy] [Kimi1]

Authors: Linxue Bai ; Peter Jančovič ; Martin Russell ; Philip Weber

This paper presents an analysis of a low-dimensional representation of speech for modelling speech dynamics, extracted using bottleneck neural networks. The input to the neural network is a set of spectral feature vectors. We explore the effect of various designs and training of the network, such as varying the size of context in the input layer, size of the bottleneck and other hidden layers, and using input reconstruction or phone posteriors as targets. Experiments are performed on TIMIT. The bottleneck features are employed in a conventional HMM-based phoneme recognition system, with recognition accuracy of 70.6% on the core test achieved using only 9-dimensional features. We also analyse how the bottleneck features fit the assumptions of dynamic models of speech. Specifically, we employ the continuous-state hidden Markov model (CS-HMM), which considers speech as a sequence of dwell and transition regions. We demonstrate that the bottleneck features preserve well the trajectory continuity over time and can provide a suitable representation for CS-HMM.

#2 Statistical acoustic-to-articulatory mapping unified with speaker normalization based on voice conversion [PDF] [Copy] [Kimi1]

Authors: Hidetsugu Uchida ; Daisuke Saito ; Nobuaki Minematsu ; Keikichi Hirose

This paper proposes a model of speaker-normalized acoustic-to-articulatory mapping using statistical voice conversion. A mapping function from acoustic parameters to articulatory parameters is usually developed with a single speaker's parallel data. Hence the constructed mapping model can work appropriately only for this specific speaker, and applying this model to other speakers degrades the performance of acoustic-to-articulatory mapping. In this paper, two models of speaker conversion and acoustic-to-articulatory mapping are implemented using Gaussian Mixture Models (GMM), and by integrating these two models, we propose two methods of speaker-normalized acoustic-to-articulatory mapping. One is concatenating these models sequentially, and the other integrates the two models into a unified model, where acoustic parameters of a speaker can be converted directly to articulatory parameters of another speaker. Experiments show that both methods can improve the mapping accuracy and that the latter method works better than the former method. Especially in the case of velar stop consonants, the mapping accuracy is higher by 0.6 mm.

#3 Analysis of features from analytic representation of speech using MP-ABX measures [PDF] [Copy] [Kimi1]

Authors: Raghavendra Reddy Pappagari ; Karthika Vijayan ; K. Sri Rama Murty

The significance of features derived from complex analytic domain representation of speech, for different applications, is investigated. Frequency domain linear prediction (FDLP) coefficients are derived from analytic magnitude and instantaneous frequency (IF) coefficients are derived from analytic phase of speech signals. Minimal pair ABX (MP-ABX) tasks are used to analyse different features and develop insights into the nature of information in them. The performance of the features derived from analytic representation are compared with performance of the Mel-Frequency Cepstral Coefficients (MFCC). It is noticed that the magnitude based features- FDLP and MFCC delivered promising PaC, PaT and CaT scores in MP-ABX tasks, demonstrating their phoneme discrimination abilities. Combining FDLP features with MFCC had proven beneficial in phoneme discrimination tasks. The IF features performed well in TaP mode of MP-ABX tasks, emphasizing the existence of speaker specific information in them. The IF significantly outperformed FDLP, MFCC and their combination in speaker discrimination task.

#4 Source-filter separation of speech signal in the phase domain [PDF] [Copy] [Kimi1]

Authors: Erfan Loweimi ; Jon Barker ; Thomas Hain

Deconvolution of the speech excitation (source) and vocal tract (filter) components through log-magnitude spectral processing is well-established and has led to the well-known cepstral features used in a multitude of speech processing tasks. This paper presents a novel source-filter decomposition based on processing in the phase domain. We show that separation between source and filter in the log-magnitude spectra is far from perfect, leading to loss of vital vocal tract information. It is demonstrated that the same task can be better performed by trend and fluctuation analysis of the phase spectrum of the minimum-phase component of speech, which can be computed via the Hilbert transform. Trend and fluctuation can be separated through low-pass filtering of the phase, using additivity of vocal tract and source in the phase domain. This results in separated signals which have a clear relation to the vocal tract and excitation components. The effectiveness of the method is put to test in a speech recognition task. The vocal tract component extracted in this way is used as the basis of a feature extraction algorithm for speech recognition on the Aurora-2 database. The recognition results shows upto 8.5% absolute improvement in comparison with MFCC features on average (0-20dB).

#5 A maximum likelihood approach to the detection of moments of maximum excitation and its application to high-quality speech parameterization [PDF] [Copy] [Kimi1]

Authors: Ranniery Maia ; Yannis Stylianou ; Masami Akamine

This paper presents an algorithm to detect moments of maximum excitation (MME) in speech. It assumes a model in which speech can be represented as a sequence of pulses located at the MME convolved with a time-varying minimum-phase impulse response. By considering that in the glottal cycle speech concentrates more energy at the MME than at other instants, the locations and amplitudes of the excitation pulses are determined through maximum likelihood estimation. The suggested approach provides a fully automatic and consistent method for the detection of MME in speech without relying on ad hoc procedures which usually do not work well across different speech styles without a required amount of adjustments. Experiments with speech parameterization, in the context of complex cepstrum analysis and synthesis, have shown that the proposed MME-based processing can improve signal to error reconstruction ratio up to 10%, when compared to the use of glottal closure instant estimations provided by a well-known algorithm.

#6 SABR: sparse, anchor-based representation of the speech signal [PDF] [Copy] [Kimi1]

Authors: Christopher Liberatore ; Sandesh Aryal ; Zelun Wang ; Seth Polsley ; Ricardo Gutierrez-Osuna

We present SABR (Sparse, Anchor-Based Representation), an analysis technique to decompose the speech signal into speaker-dependent and speaker-independent components. Given a collection of utterances for a particular speaker, SABR uses the centroid for each phoneme as an acoustic “anchor,” then applies Lasso regularization to represent each speech frame as a sparse non-negative combination of the anchors. We illustrate the performance of the method on a speaker-independent phoneme recognition task and a voice conversion task. Using a linear classifier, SABR weights achieve significantly higher phoneme recognition rates than Mel frequency Cepstral coefficients. SABR weights can also be used directly to perform accent conversion without the need to train a speaker-to-speaker regression model.

#7 Automatic transformation of irregular to regular voice by residual analysis and synthesis [PDF] [Copy] [Kimi1]

Authors: Tamás Gábor Csapó ; Géza Németh

This paper presents an automatic speech transformation method of non-ideal phonation of speech (irregular or creaky voice). The irregular-to-regular transformation is performed by analyzing and resynthesizing the residual. A recent continuous pitch estimation algorithm is used for interpolating F0 in regions of irregular voice. The linear prediction residual of irregular sections of speech is replaced by overlap-added frames from a codebook of pitch-synchronous residuals. Finally, speech is reconstructed from the residual. A listening experiment showed that by transforming natural speech samples containing irregular voice, the perceived roughness of the transformed speech is decreased.

#8 Optical sensor calibration for electro-optical stomatography [PDF] [Copy] [Kimi1]

Authors: Simon Preuß ; Peter Birkholz

We are currently developing a technology called “electro-optical stomatography” to measure and visualize articulatory movements within the vocal tract using electrical contact sensors and optical proximity sensors. To measure tongue movements with the optical sensors in this system, a mapping between the raw sensor values and actual tongue positions has to be determined. This mapping is non-linear and different for every tongue and sensor. The lack of an accurate, reliable calibration method has so far prevented wide-spread use of optical measurements within the vocal tract. Here, we present a calibration method based on a multi-linear regression model that maps the sensor value at a single distance of 0mm to calibration values at 0, 5, 10, 15, 20, 25, and 30mm. The coefficients of the model are determined by a least-squares regression in 25 training data sets (recorded with 5 subjects and 5 sensors). Evaluation in a leave-one-out cross-validation and on five more data sets recorded with another, different subject on 5 additional sensors yields very good results with maximum median position errors close to 1mm. The calibration of the optical sensors can therefore be semi-automatically accomplished based on a single, easily obtainable measurement during direct tongue contact.

#9 From text to formants — indirect model for trajectory prediction based on a multi-speaker parallel speech database [PDF] [Copy] [Kimi1]

Authors: Kálmán Abari ; Tamás Gábor Csapó ; Bálint Pál Tóth ; Gábor Olaszy

An indirect model is presented, capable of estimating formant trajectories from text only (Text-to-Formants, TTF). The result is a phonetically correct formant trajectory flow of any virtual speech signal, i.e. one that has never been uttered. The focus is on the pattern forms inside the given sound, taking into account the sound environment (up to quinphone), and not on individual formant value measurements. The model is based on a multi-speaker parallel speech database with precise manual corrections and a HMM-based formant trajectory predictor. The validation of the TTF model shows that formant trajectories can be predicted with good accuracy from text. The model indirectly gives information about a theoretically possible articulation flow of the sentence. Thus it gives a general `formantprint' of the language.

#10 Layered nonnegative matrix factorization for speech separation [PDF] [Copy] [Kimi1]

Authors: Chung-Chien Hsu ; Jen-Tzung Chien ; Tai-Shih Chi

This paper proposes a layered nonnegative matrix factorization (L-NMF) algorithm for speech separation. The standard NMF method extracts parts-based bases out of nonnegative training data and is often used to separate mixed spectrograms. The proposed L-NMF algorithm comprises of several layers of standard NMF blocks. During training, each layer of the L-NMF is initialized separately and then fine-tuned by minimizing the propagated reconstruction error. More complicated bases of the training data are emerged in deeper layers of the L-NMF by progressively combining parts-based bases extracted in the first layer. In other words, these complicated bases contain collective information of the parts-based bases. The bases deciphered by all layers are then used to separate spectrograms in the conventional NMF way. Simulation results show the proposed L-NMF outperforms the standard NMF in terms of the source-to-distortion ratio (SDR).

#11 Robust tongue tracking in ultrasound images: a multi-hypothesis approach [PDF] [Copy] [Kimi1]

Authors: Catherine Laporte ; Lucie Ménard

Ultrasound (US) imaging is an excellent means of observing tongue motion during speech. Tracking the tongue contour in US video is required for analysis of this motion, but most currently available techniques suffer from either a lack of temporal consistency or a lack of robustness to difficult conditions such as a rapidly deforming tongue or momentarily poor image quality. This paper proposes a new algorithm combining active contours, active shape models and particle filtering that addresses these shortcomings. The strength of this approach lies in the fact that it maintains multiple tongue shape hypotheses simultaneously. Experimental results show that this approach outperforms a classic active contour algorithm as well as a shape-constrained variant thereof, particularly in difficult tracking conditions.

#12 Objective measures for predicting the intelligibility of spectrally smoothed speech with artificial excitation [PDF] [Copy] [Kimi1]

Authors: Danny Websdale ; Thomas Le Cornu ; Ben Milner

A study is presented on how well objective measures of speech quality and intelligibility can predict the subjective intelligibility of speech that has undergone spectral envelope smoothing and simplification of its excitation. Speech modifications are made by resynthesising speech that has been spectrally smoothed. Objective measures are applied to the modified speech and include measures of speech quality, signal-to-noise ratio and intelligibility, as well as proposing the normalised frequency-weighted spectral distortion (NFD) measure. The measures are compared to subjective intelligibility scores where it is found that several have high correlation (|r| ≥ 0.7), with NFD achieving the highest correlation (r = -0.81).

#13 Vocal tremor analysis via AM-FM decomposition of empirical modes of the glottal cycle length time series [PDF] [Copy] [Kimi1]

Authors: Christophe Mertens ; Francis Grenez ; François Viallet ; Alain Ghio ; Sabine Skodda ; Jean Schoentgen

The presentation concerns a method that obtains the size and frequency of vocal tremor in speech sounds sustained by normal speakers and patients suffering from neurological disorders. The glottal cycle lengths are tracked in the temporal domain via salience analysis and dynamic programming. The cycle length time series is then decomposed into a sum of oscillating components by empirical mode decomposition the instantaneous envelopes and frequencies of which are obtained via an AM-FM decomposition. Based on their average instantaneous frequencies, the empirical modes are then assigned to four categories (intonation, physiological tremor, neurological tremor as well as jitter) and added within each. The within-category size of the cycle length perturbations is estimated via the standard deviation of the empirical mode sum divided by the average cycle length. The tremor frequency within the neurological tremor category is obtained via a weighted instantaneous average of the mode frequencies followed by a weighted temporal average. The method is applied to two corpora of vowels sustained by 123 and 74 control and 456 and 205 Parkinson speakers respectively.

#14 Estimating lower vocal tract features with closed-open phase spectral analyses [PDF] [Copy] [Kimi1]

Authors: Elizabeth Godoy ; Nicolas Malyska ; Thomas F. Quatieri

Previous studies have shown that, in addition to being speaker-dependent yet context-independent, lower vocal tract acoustics significantly impact the speech spectrum at mid-to-high frequencies (e.g 3-6kHz). The present work automatically estimates spectral features that exhibit acoustic properties of the lower vocal tract. Specifically aiming to capture the cyclicity property of the epilarynx tube, a novel multi-resolution approach to spectral analyses is presented that exploits significant differences between the closed and open phases of a glottal cycle. A prominent null linked to the piriform fossa is also estimated. Examples of the feature estimation on natural speech of the VOICES multi-speaker corpus illustrate that a salient spectral pattern indeed emerges between 3-6kHz across all speakers. Moreover, the observed pattern is consistent with that canonically shown for the lower vocal tract in previous works. Additionally, an instance of a speaker's formant (i.e. spectral peak around 3kHz that has been well-established as a characteristic of voice projection) is quantified here for the VOICES template speaker in relation to epilarynx acoustics. The corresponding peak is shown to be double the power on average compared to the other speakers (20 vs 10 dB).

#15 Inductive implementation of segmental HMMs as CS-HMMs [PDF] [Copy] [Kimi1]

Authors: S. M. Houghton ; Colin J. Champion

Segmental models have been used in speech recognition to reduce the effect of the counterfactual assumptions of statistical independence which are made in more conventional systems. They have achieved their aim at the cost of a large increase in computational load arising from making assumptions on entire segments rather than on individual frames. In this paper we show how segmental algorithms can be refactored as iterative calculations, removing most of additional computational burden they impose. We also show that the iterative implementation leads naturally to increased flexibility in the handling of timing, allowing an arbitrary timing model to be incorporated at no extra cost.

#16 A discriminative analysis within and across voiced and unvoiced consonants in neutral and whispered speech in multiple indian languages [PDF] [Copy] [Kimi1]

Authors: G. Nisha Meenakshi ; Prasanta Kumar Ghosh

Whispered speech lacks the vocal chord vibration which is typically used to distinguish voiced and unvoiced consonants, making their discrimination a challenging task. In this work, we objectively and subjectively quantify the amount of discrimination between a voiced (V) consonant and its unvoiced (UV) counterpart using seven V-UV consonant pairs in six Indian languages, in neutral and whispered speech. We also quantify the extent to which the voicing characteristics in a consonant changes from neutral to whispered speech. Experiments using vowel-consonant-vowel (VCV) stimuli demonstrate that the V-UV discrimination reduces from neutral to whispered speech in a consonant specific manner with highest reduction for /ɡ/-/k/ pair and least reduction for /z/-/s/ pair. Interestingly, this reduction in objectively measured discrimination does not directly correlate with the reduction in the V-UV classification accuracy obtained from subjective evaluation. Results from listening test show that the maximum and minimum reduction in the V-UV classification accuracy occur for /ʤ/-/ʧ/ and /v/-/f/ pairs when whispered. Whispered Tamil and Telugu VCV achieve the highest (85.71%) and lowest (58.93%) subjective V-UV classification accuracy respectively, demonstrating the variability in the production and perception whispered consonants across languages.

#17 Aligning meeting recordings via adaptive fingerprinting [PDF] [Copy] [Kimi1]

Authors: T. J. Tsai ; Andreas Stolcke

This paper proposes a robust and efficient way to temporally align a set of unsynchronized meeting recordings, such as might be collected by participants' cell phones. We propose an adaptive audio fingerprint which is learned on-the-fly in a completely unsupervised manner to adapt to the characteristics of a given set of unaligned recordings. The design of the adaptive audio fingerprint is formulated as a series of optimization problems which can be solved very efficiently using eigenvector routines. We also propose a method of aligning sets of files which uses the cumulative evidence from previous alignments to help align the weakest matches. Based on challenging alignment scenarios extracted from the ICSI meeting corpus, the proposed alignment system is able to achieve > 99% alignment accuracy at a 100ms error tolerance.

#18 On representation learning for artificial bandwidth extension [PDF] [Copy] [Kimi1]

Authors: Matthias Zöhrer ; Robert Peharz ; Franz Pernkopf

Recently, sum-product networks (SPNs) showed convincing results on the ill-posed task of artificial bandwidth extension (ABE). However, SPNs are just one type of many architectures which can be summarized as representational models. In this paper, using ABE as benchmark task, we perform a comparative study of Gauss Bernoulli restricted Boltzmann machines, conditional restricted Boltzmann machines, higher order contractive autoencoders, SPNs and generative stochastic networks (GSNs). Especially the latter ones are promising architectures in terms of its reconstruction capabilities. Our experiments show impressive results of GSNs, achieving on average an improvement of 3.90dB and 4.08dB in segmental SNR on a speaker dependent (SD) and speaker independent (SI) scenario compared to SPNs, respectively.

#19 AM-FM based filter bank analysis for estimation of spectro-temporal envelopes and its application for speaker recognition in noisy reverberant environments [PDF] [Copy] [Kimi1]

Authors: Dhananjaya Gowda ; Rahim Saeidi ; Paavo Alku

In this paper, a new AM-FM based filter bank analysis for the estimation of spectro-temporal envelope (STE) of speech signals is proposed. The filter bank is simulated by filtering a frequency translated signal using a single resonator centered around the Nyquist frequency. The proposed design of using a single fixed resonator provides distinct advantages over the traditional methods of filter bank design. First, it provides a simple IIR filter with a smooth frequency response with no ripples. Second, the bandwidth of the resonator can be easily controlled by the multiplicity of poles and their proximity to the unit circle on the z-plane. Third, the resonator fixed at the highest possible center frequency provides the best separation between the AM and FM components of the filtered signal. Speaker recognition experiments on noisy and reverberant speech with short test segments show that the proposed AM-FM based filter bank analysis for STE estimation provides consistent improvement over a recently proposed discrete cosine transform based filter bank approach.

#20 Fast and accurate phase unwrapping [PDF] [Copy] [Kimi1]

Authors: Thomas Drugman ; Yannis Stylianou

More and more speech technology and signal processing applications make use of the phase information. A proper estimation and representation of the phase goes inextricably along with a correct phase unwrapping, which refers to the problem of finding the instance of the phase function chosen to ensure continuity. This paper proposes a new technique of phase unwrapping which is based on two mathematical considerations: i) a property of the unwrapped phase at Nyquist frequency, ii) the modified Schur-Cohn's algorithm which allows a fast calculation of the root distribution of polynomials with respect to the unit circle. The proposed method is compared to five state-of-the-art phase unwrappers on a large dataset of both synthetic random and real speech signals. By leveraging the two aforementioned considerations, the proposed approach is shown to perform an exact estimation of the unwrapped phase at a reduced computational load.

#21 Sparse representation with temporal max-smoothing for acoustic event detection [PDF] [Copy] [Kimi1]

Authors: Xugang Lu ; Peng Shen ; Yu Tsao ; Chiori Hori ; Hisashi Kawai

In order to incorporate long temporal-frequency structure for acoustic event detection, we have proposed a spectral patch based learning and representation method. The learned spectral patches were regarded as acoustic words which were further used in sparse encoding for acoustic feature representation and modeling. In our previous study, during feature encoding stage, each spectral patch was encoded independently. Considering that spectral patches taken from a time sequence should keep similar representations for neighboring patches after encoding, in this study, we propose to enhance the temporal correlation of feature representation using a temporal max-smoothing algorithm. The max-smoothing tries to pick up the maximum response in a local time window as the representative feature for detection task. We tested the new feature for automatic detection of acoustic events which were selected from lecture audio data. Experimental results showed that the temporal max-smoothing significantly improved the performance.

#22 Estimation of glottal closure instants from telephone speech using a group delay-based approach that considers speech signal as a spectrum [PDF] [Copy] [Kimi1]

Authors: G Anushiya Rachel ; P Vijayalakshmi ; T Nagarajan

Glottal closure instants (GCIs) are characterized by a strong negative valley in the speech signal and an abrupt change in the amplitude. In this paper, an algorithm that exploits these two properties of a GCI is proposed to estimate the location of GCIs, specifically from telephone speech. The algorithm considers a symmetrized voiced segment as the Fourier transform of an even signal. In such a case, the negative valleys in the spectrum correspond to zeros that lie outside the unit circle in the z-plane. The angular location of these zeros indicate the location of the GCIs. The angular location can be estimated from the group delay spectrum of the even signal, since a phase change of 2π, between adjacent frequency bins, occurs at the location of a zero that lies outside the unit circle. The performance of the algorithm is evaluated on a simulated speech corpora derived from CMU and CSTR databases and the NTIMIT database, in terms of identification, false alarm, and miss rates. The proposed algorithm is compared with DYPSA, YAGA, and SEDREAMS, and is found to outperform all the algorithms when used on telephone speech.

#23 The role of prosody and voice quality in text-dependent categories of storytelling across languages [PDF] [Copy] [Kimi1]

Authors: Raúl Montaño ; Francesc Alías

In contrast to full-blown emotions, storytelling speech entails a particular speaking style that contains subtle expressive nuances of which little is known. In the present work, we study the role of prosody and voice quality while searching for cross-linguistic acoustic similarities in two categories of storytelling speech that are defined by their lexical components: the descriptive mode and sentences that specify a character intervention, together with a third neutral category (perceptually validated as reference). The study addresses four narrators using four different European languages (English, French, German and Spanish) expressing the same story. After conducting several statistical and discriminant analyses, we find that all narrators under analysis exploit some acoustic parameters in a similar way to differentiate among the analysed storytelling categories. Specifically, we observe that three prosodic features (mean fundamental frequency, mean intensity and number of silent pauses) and two voice quality parameters (mean Harmonic-to-Noise Ratio and Maxima Dispersion Quotient) explain a relatively similar proportion of the variance among storytelling categories in all languages. Moreover, the classification results obtained from the discriminant analysis are comparable for the three considered storytelling categories across languages.

#24 Neuromorphic based oscillatory device for incremental syllable boundary detection [PDF] [Copy] [Kimi1]

Authors: Alexandre Hyafil ; Milos Cernak

Syllables are considered as basic supra-segmental units, used mainly in prosodic modelling. It has long been thought that efficient syllabification algorithms may also provide valuable cues for improved segmental (acoustic) modelling. However, the best current syllabification methods work offline, considering the power envelope of whole utterance. In this paper we introduce a new method for detection of syllable boundaries based on a model of speech parsing into syllables by neural oscillations in human auditory cortex. Neural oscillations automatically lock to speech slow fluctuations that convey the syllabic rhythm. Similarly as humans encode speech incrementally, i.e., not considering future temporal context, the proposed method works incrementally as well. In addition, it is highly robust to noise. Syllabification performance for English and different noise conditions was compared to the existing Mermelstein and group delay algorithms. While the performance of the existing methods depend on the type of noise and signal to noise ratio, the performance of the proposed method is constant under all noise conditions.

#25 Analyzing speech rate entrainment and its relation to therapist empathy in drug addiction counseling [PDF] [Copy] [Kimi2]

Authors: Bo Xiao ; Zac E. Imel ; David C. Atkins ; Panayiotis G. Georgiou ; Shrikanth S. Narayanan

A key quality index in drug addiction counseling such as Motivational Interviewing is the degree of therapist's empathy towards the client. Empathy ratings are meant to evaluate the therapist's understanding of the patient's feelings, through their sensitivity and care of response. Empathy is also associated with the manifestation of behavioral entrainment in the interaction. In this paper, we compute a measure of entrainment in speech rate during dyadic interactions, and investigate its relation to perceived empathy. We show that the averaged absolute difference of turn-level speech rates between the therapist and the patient correlates with the ratings of therapist empathy. We also present the correlation of empathy to the statistics of speech and silence durations. Finally we show that in the task of automatically predicting high or low empathy, speech rate cues provide complementary information to previously proposed prosodic cues. These findings suggest speech rate as an important behavioral cue that is modulated by entrainment and contributes to empathy modeling.